Foundations of Geospatial Analysis

Adam Dennett

Centre for Advanced Spatial Analysis, University College London

About Me

  • Professor of Urban Analytics & Current Head of Department @ Bartlett Centre for Advanced Spatial Analysis (CASA), UCL

  • Geographer by background - ex-Secondary School Teacher - back in HE for 15+ years

  • Taught GIS / Spatial Data Science at postgrad level for last 10 years

About this session

  • Whistle-stop tour of some of the key concepts relating to spatial data

  • An illustrative example analysing some spatial data in London - demonstrating the “spatial is special” idiom and how we might account for spatial factors in our analysis

  • I hope you’ll all leave with a better understanding of how we should pay attention to the influence of space in any analysis

Key Geospatial Concepts

  • Where? (absolute)
  • Where? (relative)
  • How near or distant?
  • What scale?
  • What shape?

Where? (absolute)

  • Everything happens somewhere

    • We’re here: Wallspace, 22 Duke’s Road, Camden, London, England, *Europe, Northern Hemisphere, Earth

Where? (absolute)

  • How do we know exactly where?

XKCD - No, The Other One

https://xkcd.com/2480/

Where? Coordinate Reference Systems

  • More reliable than names (that are rarely unique or reference fuzzy locations), are coordinates

  • The earth is roughly spherical and points anywhere on its surface can be described using the World Geodetic System (WGS) - a geographic (spherical) coordinate system

  • Points can be referenced according to their position on a grid of latitudes (degrees north or south of the equator) and longitudes (degrees east or west of the Prime - Greenwich - meridian)

  • The last major revision of the World Geodetic System was in 1984 and WGS84 is still used today as the standard system for references places on the globe.

https://www.earthdatascience.org/courses/use-data-open-source-python/intro-vector-data-python/spatial-data-vector-shapefiles/geographic-vs-projected-coordinate-reference-systems-python/

Where? Coordinate Reference Systems

  • Projected Coordinate Reference Systems convert the 3D globe to a 2D plane and can do so in a huge variety of different ways

  • Most national mapping agencies have their own projected coordinate systems - in Britain the Ordnance Survey maintain the British National Grid which locates places according to 6-digit Easting and Northing coordinates

  • Every coordinate system can be referenced by its EPSG code, e.g. WGS84 = 4326 or British National Grid = 27700 with mathematical transformations to convert between them

Where? Describing and Locating Things with Coordinates

  • Once we have a coordinate reference system we can locate objects accurately in space

  • Most objects that spatial data scientists are concerned with (apart from gridded representations, which we will ignore for now!) can be simplified to either a point, a line or a polygon in that space

  • Polygons and lines are just multiple point coordinates joined together!

Where? Relative - Tobler’s First Law of Geography

“Everything is related to everything else, but near things are more related than distant things.”

  • This observation underpins much of what spatial data scientists do

  • Being able to locate something in space allows us to:

    • explain why something may be occurring where it is

    • make better predictions about nearby or further away things

  • Underpins the whole Geodeomographics (customer segmentation) industry!!

Where? Relative - Defining ‘near’ and ‘distant’

  • Near and distant can mean different things in different contexts

    • the furthest one would travel to buy a pint of milk is somewhat different to furthest one might be willing to commute for a job
  • In spatial data science one way of separating near from distant can simply be to define their topological relationship - Dimensionally Extended 9-Intersection Model (DE-9IM) is the standard topological model used in GIS

  • Touching or overlapping objects = ‘near’

Where? Relative - Exploring Near and Distant

  • Near and distant in London

Where? Relative - Exploring Near and Distant

  • If we measure the distance from the centre (centroid) of one ward to another, then we might decide that the 1st, 2nd, 3rd, kth. closest wards are near, the others are far.

Where? Relative - Exploring Near and Distant

  • We can then decide to include the “k” nearest neighbours or exclude the rest

Where? Relative - Exploring Near and Distant

  • Other conceptions of near might include any contiguous ward with distant simply being those which are not contiguous

  • Near or distant could also be defined by some distance threshold

Spatial Analysis of ‘where’?

  • Where in London do students perform best and worst in their post-16 exams?

Is there any pattern? Do better scores and worse scores appear to be clustered? How can we tell?

Spatial Autocorrelation

  • Spatial Autocorrelation - phenomenon of near things being more similar than distant things.

    • Do neighbouring wards have more similar GCSE points scores than distant wards?
  • Can test for spatial autocorrelation by comparing the GCSE Scores in any given ward with the GCSE scores in neighbouring wards (however we choose to define our neighbours - k-nearest, those that are contiguous etc.)

  • Average value of GCSE scores in the neighbouring wards is known as the spatial lag of GSCE scores

Spatial Autocorrelation

  • If there is a linear correlation between the variable and its spatial lag, we can observe that values in near places do tend to cluster
                          (Intercept) average_gcse_capped_point_scores_2014 
                          190.2624075                             0.4190508 

Moran’s I

  • Moran’s I is another name for the least-squares regression slope parameter
  • Values range from +1 (perfect spatial autocorrelation) to -1 (perfect dispersal) with values close to 0 indicating no relationship
moran.test(LondonWardsMerged$average_gcse_capped_point_scores_2014, nb2listw(LWard_nb))

    Moran I test under randomisation

data:  LondonWardsMerged$average_gcse_capped_point_scores_2014  
weights: nb2listw(LWard_nb)    

Moran I statistic standard deviate = 17.785, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.4190507533     -0.0016025641      0.0005594495 

Moran’s I

  • Moran’s I = 0.42

  • Moderate, positive spatial autocorrelation between average GCSE scores in London - some clustering of both low and high scores

Explaining Spatial Patterns

  • Having observed some spatial patterns in school exam performance in London, we might next want to explain these patterns, perhaps using another variable measured for the same spatial units.

  • Our own experience might tell us that missing class could negatively impact our ability to learn things in that class

  • Hypothesis: wards where there are higher rates of absence from school might tend to experience lower average exam grades

Explaining Spatial Patterns

  • Taking the whole of London, it would appear that there is a moderately strong, negative relationship between missing school and exam performance

  • For every 1% of additional school days missed, we might expect a decrease of -41 points in GCSE score.

  • But does this relationship hold true across all wards in the city?

                                     (Intercept) 
                                       371.71500 
unauthorised_absence_in_all_schools_percent_2013 
                                       -41.40264 

Explaining Spatial Patterns

  • Moran’s I of GSCE scores means that we already know that the observations are probably not independent of each other (violating one assumption of regression)

  • Mapping the residual values from the regression model allows us to observe any spatial clustering in the errors

  • Clustering of residuals could also indicate a violation of the independence assumption of errors


    Moran I test under randomisation

data:  LondonWardsMerged$model1_resids  
weights: nb2listw(LWard_nb)    

Moran I statistic standard deviate = 12.183, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.2862894906     -0.0016025641      0.0005583971 

Dealing with Spatial Patterns - Spatial Regression Models

  • One way of coping with spatial dependence in the dependent variable is to include the spatial lag of that variable as an independent explanatory variable

  • Running the spatial lag model reveals that the spatial lag is statistically significant and has the effect of reducing the estimated impact of missing 1% of schools days from -42 points to -31 points.


Call:
lagsarlm(formula = average_gcse_capped_point_scores_2014 ~ unauthorised_absence_in_all_schools_percent_2013, 
    data = LondonWardsMerged, listw = nb2listw(LWard_nb, style = "W"), 
    method = "eigen")

Residuals:
      Min        1Q    Median        3Q       Max 
-68.70402  -9.44615  -0.64207   8.53417  58.56788 

Type: lag 
Coefficients: (asymptotic standard errors) 
                                                 Estimate Std. Error z value
(Intercept)                                      207.4009    15.0053  13.822
unauthorised_absence_in_all_schools_percent_2013 -30.7843     2.0792 -14.806
                                                  Pr(>|z|)
(Intercept)                                      < 2.2e-16
unauthorised_absence_in_all_schools_percent_2013 < 2.2e-16

Rho: 0.46705, LR test value: 104.93, p-value: < 2.22e-16
Asymptotic standard error: 0.041738
    z-value: 11.19, p-value: < 2.22e-16
Wald statistic: 125.22, p-value: < 2.22e-16

Log likelihood: -2581.93 for lag model
ML residual variance (sigma squared): 217.21, (sigma: 14.738)
Number of observations: 625 
Number of parameters estimated: 4 
AIC: 5171.9, (AIC for lm: 5274.8)
LM test for residual autocorrelation
test value: 3.0949, p-value: 0.078537